Resources

Introduction to RStudio and Project management

Project management

RStudio is an integrated development environment (IDE). It provides a (much prettier) interface for the R software. R is integrated into RStudio, so you never actually have to open R.

R Studio gives a functionality of creating projects: self-contained working space (i.e. working directory), to which R will refer to, when looking for and saving files. You can create projects in existing directories( folders) or create a directory anew.

We’re going to create a project in RStudio in the existing directory:

  • File
  • New Project
  • Existing directory
  • browse for directory directory you created when downloading the data: gdc -Create project

Organising project/working directory

This is one suggestion of how your R project can look like. Your data folder is already there. Let’s go ahead and create the other folders.

Two main ways to interact with R

  • Test and play within the interactive R console (chat)
    • Pros: immediate results
    • Cons: work lost once we close RStudio
  • Start writing in an .R file ( email )
    • File still will be executed in the console
    • Pros: complete record of what you did!
    • Cons: Can be messy if we’re just want to print things out

Running the code

  • In the console: Enter
  • In the script:
    • Ctrl + Enter (for MAC users: Command + Enter)
    • Run button on right left - current line or selection

Creating a script

We’re going to work with a script. Let’s create one now and save it in the scripts directory.

  • File
  • New File
  • R Script
  • A new Untitled script will appear in the source pane. Save it using floppy disc icon.
  • Name it intro-to-r.R

Packages

A great power of R lays in packages: add-on sets of functions that are build by the community and once they go through a quality process they are available to download from a repository called CRAN. They need to be explicitly activated. Now, we will be using tidyverse package, which is actually a collection of useful packages. Another package that will be useful for us is here.

If you have have not installed this package earlier, please do so. You can check if you have it installed in the Packages pane in the bottom-right window.

# install.packages('tidyverse')
install.packages('here')

You need to install package only once, but you will need to load it each time you want to use its functionalities. To do that you use library() command:

library(tidyverse)
library(here)

Handling paths

Credit:kaggle.com

You have created a project which is your working directory, and a number of subfolders, that will help you organise your project better. But now each time you will save or retrieve a file from those folders, you will need to specify the path from the folder you are in (most likely scripts).

That can become complicated and can become a reproducibility problem if the person using your code (e.g. future you) is working in a different subfolder.

here() to the rescue! This package provides absolute paths from the root (main directory) of your project.

Credit:Allison horst

here('data')
## [1] "C:/Users/awilczynski/Desktop/R Cafe/geospatial-data-carpentry-tud-2022-11/data"

Download files

We still need to download data for the first part of the workshop. You can do with with the function download.file(). We will save it in the data folder, where the raw data should go.

download.file('bit.ly/GeospatialGapminder', here('data','gapminder_data.csv'), mode = 'wb')

Intro to R

Use R as a calculator

1+100
## [1] 101

Variables and assignment

We can store values in variables using the assignment operator <-, like this:

x <- 1/40

Notice that assignment does not print a value. Instead, we stored it for later in something called a variable. x now contains the value 0.025:

x
## [1] 0.025

Look for the Environment tab in one of the panes of RStudio, and you will see that x and its value have appeared. Our variable x can be used in place of a number in any calculation that expects a number, e.g. when caclulating a square root:

sqrt(x)
## [1] 0.1581139

Variables can be also reassigned:

x <- 100
x
## [1] 100

You can use the ‘old’ value when reassigning the value

y <- sqrt(x) # you can use value stored in object x to create y
y
## [1] 10

Data Structures

Vectors

So far we’ve looked on individual values. Now we will move to a data structure called vectors. Vectors are arrays of values of a same data type (will explain in a second :) ) .

You can create a vector with a c() function.

numeric_vector <- c(2, 6, 3) # vector of numbers - numeric data type.
numeric_vector
## [1] 2 6 3
character_vector <- c('banana', 'apple', 'orange') # vector of words - more precisely strings of characters- character data type
character_vector
## [1] "banana" "apple"  "orange"
logical_vector <- c(TRUE, FALSE, TRUE) # vector of logical values (is something true or false?)- logical data type.
logical_vector
## [1]  TRUE FALSE  TRUE

Combining vectors

The combine function, c(), will also append things to an existing vector:

ab_vector <- c('a', 'b')
ab_vector
## [1] "a" "b"
abcd_vector <- c(ab_vector, 'c', 'd')
abcd_vector
## [1] "a" "b" "c" "d"

Missing values

A common operation you want to perform is to remove all the missing values (in R denoted as NA). Let’s have a look how to do it:

with_na <- c(1, 2, 1, 1, NA, 3, NA ) # vector including missing value

First, let’s try to calculate mean for the values in this vector

mean(with_na) # mean() function cannot interpret the missing values
## [1] NA
mean(with_na, na.rm = T) # You can add the argument na.rm=TRUE to calculate the result while ignoring the missing values.
## [1] 1.6

However, sometimes, you would like to have the NA completely removed from your vector. for this you need to identify which elements of the vector hold missing values with is.na() function.

is.na(with_na) #  This will produce a vector of logical values, stating if a statement 'This element of the vector is a missing value' is true or not
## [1] FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE
!is.na(with_na) # # The ! operator means negation ,i.e. not is.na(with_na)
## [1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE

We know which elements in the vectors are NA. Now we need to retrieve the subset of the with_na vector that is not NA. Any subsetting in R is done with square brackets[ ].

without_na <- with_na[!is.na(with_na)] # this notation will return only the elements that have TRUE on their respective positions

without_na
## [1] 1 2 1 1 3

Factors (adapted from Starting with Data)

Another important data structure is called a factor. Factors look like character data, but are used to represent categorical information.

Factors create a structured relation between the different levels (values) of a categorical variable, such as days of the week or responses to a question in a survey. While factors look (and often behave) like character vectors, they are actually treated as integer vectors by R. So you need to be very careful when treating them as strings.

Create factors

Once created, factors can only contain a pre-defined set of values, known as levels.

nordic_str <- c('Norway', 'Sweden', 'Norway', 'Denmark', 'Sweden')
nordic_str # regular character vectors printed out
## [1] "Norway"  "Sweden"  "Norway"  "Denmark" "Sweden"
nordic_cat <- factor(nordic_str) # factor() function converts a vector to factor data type
nordic_cat # With factors, R prints out additional information - 'Levels'
## [1] Norway  Sweden  Norway  Denmark Sweden 
## Levels: Denmark Norway Sweden

Inspect factors

R will treat each unique value from a factor vector as a level and (silently) assign numerical values to it. This come in handy when performing statistical analysis. You can inspect and adapt levels of the factor.

levels(nordic_cat) # returns all levels of a factor vector.  
## [1] "Denmark" "Norway"  "Sweden"
nlevels(nordic_cat) # returns number of levels in a vector
## [1] 3

Reorder levels

Note that R sorts the levels in the alphabetic order, not in the order of occurrence in the vector. R assigns value of 1 to level ‘Denmark’, 2 to ‘Norway’ and 3 to ‘Sweden’. This is important as it can affect e.g. the order in which categories are displayed in a plot or which category is taken as a baseline in a statistical model.

You can reorder the categories using factor() function.

nordic_cat <- factor(nordic_cat, levels = c('Norway' , 'Denmark', 'Sweden')) # now Norway should be the first category, Denmark second and Sweden third

nordic_cat
## [1] Norway  Sweden  Norway  Denmark Sweden 
## Levels: Norway Denmark Sweden
str(nordic_cat) # you can also inspect vectors with str() function. In facto vectors, it shows the underlying values of each category. You can also see the structure in the environment tab of RStudio.
##  Factor w/ 3 levels "Norway","Denmark",..: 1 3 1 2 3

There is more than one way to reorder factors. Later in the lesson, we will use fct_relevel() function from forcats package to do the reordering.

Note of caution

Remember that once created, factors can only contain a pre-defined set of values, known as levels. It means that whenever you try to add something to the factor vector outside of this set, it will become an unknown/missing value detonated by R as NA.

nordic_str
## [1] "Norway"  "Sweden"  "Norway"  "Denmark" "Sweden"
nordic_cat2 <- factor(nordic_str, levels = c('Norway', 'Denmark'))
nordic_cat2 # since we have not included Sweden in the list of factor levels, it has become NA.
## [1] Norway  <NA>    Norway  Denmark <NA>   
## Levels: Norway Denmark

Exploring Data frames

Now we turn to the bread-and-butter of working with R: working with tabular data. In R data are stored in a data structure called data frames.

A data frame is the representation of data in the format of a table where the columns are vectors that all have the same length.

Because columns are vectors, each column must contain a single type of data (e.g., characters, integers, factors). For example, here is a figure depicting a data frame comprising a numeric, a character, and a logical vector.

Reading data

read.csv() is a function used to read coma separated data files (.csv format)). There are other functions for files separated with other delimiters. We’re gonna read in the gap minder data set with information about countries’ size, GDP and average life expectancy in different years.

gapminder <- read.csv(here('data','gapminder_data.csv') )

Exploring dataset

Let’s investigate the gapminder data frame a bit; the first thing we should always do is check out what the data looks like.

It is important to see if all the variables (columns) have the data type that we require. Otherwise we can run into trouble.

str(gapminder) 
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

We can see that the gapminder object is a data.frame with 1704 observations/ rows and 6 variables/columns. In each line after a $ sign, we see the name of each column, its type and first few values.

There are multiple ways to explore a data set. Here are just a few examples

head(gapminder) # see first 5  rows of the data set
##       country year      pop continent lifeExp gdpPercap
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007
## 4 Afghanistan 1967 11537966      Asia  34.020  836.1971
## 5 Afghanistan 1972 13079460      Asia  36.088  739.9811
## 6 Afghanistan 1977 14880372      Asia  38.438  786.1134
summary(gapminder) # gives basic statistical information about each column. Information format differes by data type.
##    country               year           pop             continent        
##  Length:1704        Min.   :1952   Min.   :6.001e+04   Length:1704       
##  Class :character   1st Qu.:1966   1st Qu.:2.794e+06   Class :character  
##  Mode  :character   Median :1980   Median :7.024e+06   Mode  :character  
##                     Mean   :1980   Mean   :2.960e+07                     
##                     3rd Qu.:1993   3rd Qu.:1.959e+07                     
##                     Max.   :2007   Max.   :1.319e+09                     
##     lifeExp        gdpPercap       
##  Min.   :23.60   Min.   :   241.2  
##  1st Qu.:48.20   1st Qu.:  1202.1  
##  Median :60.71   Median :  3531.8  
##  Mean   :59.47   Mean   :  7215.3  
##  3rd Qu.:70.85   3rd Qu.:  9325.5  
##  Max.   :82.60   Max.   :113523.1

When you’re analyzing a data set, you often need to access its specific columns.

One handy way to access a column is using it’s name and a dollar sign $:

country_vec <- gapminder$country  # Notation means: From dataset gapminder, give me column country. You can see that the column accessed in this way is just a vector of characters. 

head(country_vec)
## [1] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan"
## [6] "Afghanistan"

Note that the calling a column with a $ sign will return a vector, it’s not a data frame anymore.

Data frame Manipulation with dplyr

Select

Let’s start manipulating the data.

First we will adapt our dataset, by keeping only the columns we’re interested in using the select() function from dplyr package:

year_country_gdp <- select(gapminder, year, country, gdpPercap) 

head(year_country_gdp)
##   year     country gdpPercap
## 1 1952 Afghanistan  779.4453
## 2 1957 Afghanistan  820.8530
## 3 1962 Afghanistan  853.1007
## 4 1967 Afghanistan  836.1971
## 5 1972 Afghanistan  739.9811
## 6 1977 Afghanistan  786.1134

Pipe

Now, this is not the most common notation when working with dplyr package. dplyr offers an operator %>% called a pipe, which allows you build up a very complicated commands in a readable way.

In newer installation of R you can also find a notation |> . This pipe does exactly the same, the only difference is that you don’t need to load any pacakges to have it available.

The select() statement with pipe would look like that:

year_country_gdp <- gapminder %>% 
  select(year,country,gdpPercap)

head(year_country_gdp)
##   year     country gdpPercap
## 1 1952 Afghanistan  779.4453
## 2 1957 Afghanistan  820.8530
## 3 1962 Afghanistan  853.1007
## 4 1967 Afghanistan  836.1971
## 5 1972 Afghanistan  739.9811
## 6 1977 Afghanistan  786.1134

First we define data set, then with the use of pipe we pass it on to the select() function. This way we can chain multiple functions together, which we will be doing now.

Filter

We already now how to select only the needed columns. But now, we also want to filter the data set via certain condition with filter() function. Instead doing it in separate steps , we can do it all together.

In the gapminder data set, we want to see the results only for Europe for 21st century.

year_country_gdp_euro <- gapminder %>% 
  filter(continent == "Europe" & year> 2000) %>%
  select(year, country, gdpPercap)

head(year_country_gdp_euro)
##   year country gdpPercap
## 1 2002 Albania  4604.212
## 2 2007 Albania  5937.030
## 3 2002 Austria 32417.608
## 4 2007 Austria 36126.493
## 5 2002 Belgium 30485.884
## 6 2007 Belgium 33692.605

Let’s now find all the observations from Eurasia:

year_country_gdp_eurasia <- gapminder %>% 
  filter(continent == "Europe" | continent == "Asia") %>%
  select(year, country, gdpPercap)

head(year_country_gdp_eurasia)
##   year     country gdpPercap
## 1 1952 Afghanistan  779.4453
## 2 1957 Afghanistan  820.8530
## 3 1962 Afghanistan  853.1007
## 4 1967 Afghanistan  836.1971
## 5 1972 Afghanistan  739.9811
## 6 1977 Afghanistan  786.1134

Challenge Write a single command (which can span multiple lines and includes pipes) that will produce a dataframe that has the African values for life expectancy, country and year, but not for other Continents. How many rows does your dataframe have and why?

countdown::countdown_fullscreen(minutes = 3)

Group and summarize

So far, we have created a dataset for one of the continents represented in the gapminder dataset. But rather than doing that, we want to know statistics about all of the continents, presented by group.

gapminder %>% # select the dataset
  group_by(continent) %>% # group by continent
  summarize(avg_gdpPercap = mean(gdpPercap)) # summarize function creates statistics for the data set 
## # A tibble: 5 Ă— 2
##   continent avg_gdpPercap
##   <chr>             <dbl>
## 1 Africa            2194.
## 2 Americas          7136.
## 3 Asia              7902.
## 4 Europe           14469.
## 5 Oceania          18622.

Challenge Calculate the average life expectancy per country. Which country has the longest average life expectancy and which has the shortest average life expectancy?

Hint Use max() and min() functions to find minimum and maximum.

countdown::countdown_fullscreen(minutes = 3)

You can also group by multiple columns:

gapminder %>%
  group_by(continent, year) %>%
  summarize(avg_gdpPercap = mean(gdpPercap))
## # A tibble: 60 Ă— 3
## # Groups:   continent [5]
##    continent  year avg_gdpPercap
##    <chr>     <int>         <dbl>
##  1 Africa     1952         1253.
##  2 Africa     1957         1385.
##  3 Africa     1962         1598.
##  4 Africa     1967         2050.
##  5 Africa     1972         2340.
##  6 Africa     1977         2586.
##  7 Africa     1982         2482.
##  8 Africa     1987         2283.
##  9 Africa     1992         2282.
## 10 Africa     1997         2379.
## # … with 50 more rows

On top of this, you can also make multiple summaries of those groups:

gdp_pop_bycontinents_byyear <- gapminder %>%
  group_by(continent,year) %>%
  summarize(
    avg_gdpPercap = mean(gdpPercap),
    sd_gdpPercap = sd(gdpPercap),
    avg_pop = mean(pop),
    sd_pop = sd(pop),
    n_obs = n()
    )

Frequencies

If you need only a number of observations per group, you can use the count() function

gapminder %>%
    group_by(continent) %>%
    count()
## # A tibble: 5 Ă— 2
## # Groups:   continent [5]
##   continent     n
##   <chr>     <int>
## 1 Africa      624
## 2 Americas    300
## 3 Asia        396
## 4 Europe      360
## 5 Oceania      24

Mutate

Frequently you’ll want to create new columns based on the values in existing columns, for example to do unit conversions, or to find the ratio of values in two columns. For this we’ll use mutate().

gapminder_gdp <- gapminder %>%
  mutate(gdpBillion = gdpPercap*pop/10^9)

head(gapminder_gdp)
##       country year      pop continent lifeExp gdpPercap gdpBillion
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453   6.567086
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530   7.585449
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007   8.758856
## 4 Afghanistan 1967 11537966      Asia  34.020  836.1971   9.648014
## 5 Afghanistan 1972 13079460      Asia  36.088  739.9811   9.678553
## 6 Afghanistan 1977 14880372      Asia  38.438  786.1134  11.697659

Introduction to Visualisation

Package ggplot2 is a powerful plotting system. I will introduce key features of ggplot. Later today/ on Monday you will use this package to visualize geospatial data. gg stands for grammar of graphics, the idea that three components needed to create a graph are: - data - aesthetics - coordinate system on which we map the data ( what is represented on x axis, what on y axis) - geometries - visual representation of the data (points, bars, etc.)

fun part about ggplot is that you can then add additional layers to the plot providing more information and make it more beautiful.

First, lets plot distribution of life expectancy in the gapminder dataset.

  ggplot(data =gapminder,  aes(x = lifeExp)) + # aesthetics layer 
  geom_histogram() # geometry layer

You can see that in ggplot you use + as a pipe, to add layers. Within ggplot call, it is the only pipe that will work. But, it is possible to chain operations on a dataset with a pipe that we have already learned: %>% ( or |>) and follow them by ggplot grammar.

Let’s create another plot, this time only on a subset of observations:

gapminder %>%  # we select a dataset
  filter(year == 2007, 
         continent == 'Americas') %>% # and filter it to keep only one year and one continent
  ggplot(aes(x = country, y = gdpPercap)) + # we create aesthetics, both x and y axis represent values of  columns
  geom_col() # we select a column graph as a geometry

Now, you can iteratively improve how the plot looks. For example, you might want to flip it, to better display the labels.

gapminder %>%  
  filter(year == 2007, 
         continent == 'Americas') %>% 
  ggplot(aes(x = country, y = gdpPercap)) + 
  geom_col()+ 
  coord_flip()

One thing you might want to change here is the order in which countries are displayed. It would be easier to compare GDP per capita, if theY were showed in order. To do that, we need to reorder factor levels (you remember, we’ve already done this before). the order of the levels will depend on another variable - GDP per capita.

gapminder %>%  
  filter(year == 2007, 
         continent == 'Americas') %>% 
  mutate(country = fct_reorder(country, gdpPercap )) %>%
  ggplot(aes(x = country , y = gdpPercap)) + 
  geom_col() +
  coord_flip()

Let’s make things more colorful - let’s represent the average life expectancy of a country by color

gapminder %>%  
  filter(year == 2007, 
         continent == 'Americas') %>% 
  mutate(country = fct_reorder(country, gdpPercap )) %>%
  ggplot(aes(x = country, y = gdpPercap, fill = lifeExp   )) + # fill argument for coloring surfaces, color for points and lines
  geom_col()+ 
  coord_flip()

We can also adapt the color scale. Common choice that is used for its colorblind-proofness is viridis package.

plot_c <-
  gapminder %>%  
  filter(year == 2007, 
         continent == 'Americas') %>% 
  mutate(country = fct_reorder(country, gdpPercap )) %>%
  ggplot(aes(x = country, y = gdpPercap, fill = lifeExp   )) + 
  geom_col()+ 
  coord_flip()+
  scale_fill_viridis_c() # _c stands for continous scale 

Maybe we don’t need that much information about the life expectancy. We only want to know if it’s below or above average.

plot_d <-  # this time let's save the plot in the object.
  gapminder %>%  
  filter(year == 2007, 
         continent == 'Americas') %>% 
  mutate(country_reordered = fct_reorder(country, gdpPercap ),
         lifeExpCat = if_else(lifeExp >= mean(lifeExp), 'high', 'low' )
         ) %>%
  ggplot(aes(x = country_reordered, y = gdpPercap, fill = lifeExpCat   )) + 
  geom_col()+ 
  coord_flip()+
  scale_fill_manual(values = c('light blue', 'orange')) 

Since we saved a plot as an object, nothing has been printed out. Just like with any other object in R, if you want to see it, you need to call it.

plot_d

Now we can make use of the saved object and add things to it.

Let’s also give it a title and name the axes:

plot_d <- 
  plot_d +
  ggtitle('GDP per capita in Americas', subtitle = 'Year 2007') +
  xlab('Country')+
  ylab('GDP per capita')

plot_d

Writing data

Once we are happy with our plot we can save it in a format of our choice. Remember to save it in the dedicated folder.

ggsave(plot = plot_d, filename = here('fig_output','plot_americas_2007.pdf') ) # By default, ggsave() saves the last displayed plot, but you can also explicitly name the plot you want to save

Another output of your work you want to save is a cleaned dataset. In your analysis, you can then load directly that dataset. Say we want to save the data only for Australia:

gapminder_amr_2007 <-  gapminder %>%  
  filter(year == 2007, 
         continent == 'Americas') %>% 
  mutate(country_reordered = fct_reorder(country, gdpPercap ),
         lifeExpCat = if_else(lifeExp >= mean(lifeExp), 'high', 'low' )) 

write.csv(gapminder_amr_2007, 
          here('data_output', 'gapminder_americas_2007.csv'), 
          row.names=FALSE)